Python: Data Management¶
How to serialize and de-serialize data in Python. There are many, many options - we will show the ones that are currently popular and important in data science.
Pickle¶
This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only.
In [1]:
import pickle
In [2]:
bulbasaur = {
"id": 1,
"name": "bulbasaur",
"base_experience": 64,
"height": 7,
"is_default": True,
"order": 1,
"weight": 69,
"abilities": [
{
"is_hidden": True,
"slot": 3,
"ability": {
"name": "chlorophyll",
"url": "http://pokeapi.co/api/v2/ability/34/"
}
}
]
}
Pickle protocols¶
Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2.
Serialize¶
In [3]:
with open('data/bulbasaur.pickle', 'wb') as f:
pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)
De-serialize¶
In [4]:
with open('data/bulbasaur.pickle', 'rb') as f:
pokemon = pickle.load(f)
In [5]:
pokemon
Out[5]:
{'abilities': [{'ability': {'name': 'chlorophyll',
'url': 'http://pokeapi.co/api/v2/ability/34/'},
'is_hidden': True,
'slot': 3}],
'base_experience': 64,
'height': 7,
'id': 1,
'is_default': True,
'name': 'bulbasaur',
'order': 1,
'weight': 69}
Serialize to byte string¶
In [6]:
s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)
In [7]:
s
Out[7]:
b'\x80\x04\x95\xd5\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x02id\x94K\x01\x8c\x04name\x94\x8c\tbulbasaur\x94\x8c\x0fbase_experience\x94K@\x8c\x06height\x94K\x07\x8c\nis_default\x94\x88\x8c\x05order\x94K\x01\x8c\x06weight\x94KE\x8c\tabilities\x94]\x94}\x94(\x8c\tis_hidden\x94\x88\x8c\x04slot\x94K\x03\x8c\x07ability\x94}\x94(h\x02\x8c\x0bchlorophyll\x94\x8c\x03url\x94\x8c$http://pokeapi.co/api/v2/ability/34/\x94uuau.'
De-serialize from byte string¶
In [8]:
pokemon2 = pickle.loads(s)
In [9]:
pokemon2
Out[9]:
{'abilities': [{'ability': {'name': 'chlorophyll',
'url': 'http://pokeapi.co/api/v2/ability/34/'},
'is_hidden': True,
'slot': 3}],
'base_experience': 64,
'height': 7,
'id': 1,
'is_default': True,
'name': 'bulbasaur',
'order': 1,
'weight': 69}
Feather¶
Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./
Installation in Python
pip3 install feather-format
Installation in R
install.packages("feather")
In [10]:
from pandas_datareader import data
import arrow
import feather
Download data from Google Finance¶
In [11]:
start = arrow.get('2010-01-01')
end = arrow.get('2016-12-31')
tickers = ['AAPL', 'MSFT', 'SPY']
data_source = 'google'
panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)
In [12]:
panel.keys()
Out[12]:
Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')
Format closing prices¶
In [13]:
close = panel.loc['Close']
close = close.reset_index()
close.head()
Out[13]:
| Date | AAPL | MSFT | SPY | |
|---|---|---|---|---|
| 0 | 2010-01-04 | 30.57 | 30.95 | 113.33 |
| 1 | 2010-01-05 | 30.63 | 30.96 | 113.63 |
| 2 | 2010-01-06 | 30.14 | 30.77 | 113.71 |
| 3 | 2010-01-07 | 30.08 | 30.45 | 114.19 |
| 4 | 2010-01-08 | 30.28 | 30.66 | 114.57 |
Serialize¶
In [14]:
feather.write_dataframe(close, 'data/close.feather')
De-serialize¶
In [15]:
close2 = feather.read_dataframe('data/close.feather')
close2.head()
Out[15]:
| Date | AAPL | MSFT | SPY | |
|---|---|---|---|---|
| 0 | 2010-01-04 | 30.57 | 30.95 | 113.33 |
| 1 | 2010-01-05 | 30.63 | 30.96 | 113.63 |
| 2 | 2010-01-06 | 30.14 | 30.77 | 113.71 |
| 3 | 2010-01-07 | 30.08 | 30.45 | 114.19 |
| 4 | 2010-01-08 | 30.28 | 30.66 | 114.57 |
Sharing data frames between R and Python¶
In [16]:
%load_ext rpy2.ipython
In [17]:
%%R
library(feather)
close <- read_feather('data/close.feather')
head(close)
# A tibble: 6 x 4
Date AAPL MSFT SPY
<dttm> <dbl> <dbl> <dbl>
1 2010-01-03 19:00:00 30.57 30.95 113.33
2 2010-01-04 19:00:00 30.63 30.96 113.63
3 2010-01-05 19:00:00 30.14 30.77 113.71
4 2010-01-06 19:00:00 30.08 30.45 114.19
5 2010-01-07 19:00:00 30.28 30.66 114.57
6 2010-01-10 19:00:00 30.02 30.27 114.73
In [18]:
%%R
write_feather(close, 'data/closeR.feather')
In [19]:
close3 = feather.read_dataframe('data/closeR.feather')
close3.head()
Out[19]:
| Date | AAPL | MSFT | SPY | |
|---|---|---|---|---|
| 0 | 2010-01-04 | 30.57 | 30.95 | 113.33 |
| 1 | 2010-01-05 | 30.63 | 30.96 | 113.63 |
| 2 | 2010-01-06 | 30.14 | 30.77 | 113.71 |
| 3 | 2010-01-07 | 30.08 | 30.45 | 114.19 |
| 4 | 2010-01-08 | 30.28 | 30.66 | 114.57 |
HDF5¶
HDF5 is used to store and retrieve large quantities of hierarchically organized numeric data. It can be annotated to be “self-documenting” and makes efficient use of memory. It is a standard format, so can be transferred across language platforms.
We show a simple example of saving simulation data.
Saving to disk¶
In [20]:
import h5py
import arrow
import numpy as np
In [21]:
with h5py.File('data/sim.hdf5', 'w') as root:
root.attrs['author'] = 'Cliburn'
root.attrs['description'] = 'Simulated data sets for BIOS-821 demo'
for i in range(4):
g = root.create_group(f'expt{i:03}')
g.attrs['created'] = arrow.utcnow().format()
for j in range(5):
d = g.create_dataset(f'sim{j:03}', data=np.random.random((1000,1000)))
d.attrs['modified'] = arrow.utcnow().format()
In [22]:
root = h5py.File('data/sim.hdf5')
In [23]:
list(root.attrs.keys())
Out[23]:
['author', 'description']
In [24]:
for k, v in root.attrs.items():
print(f'{k:12}: {v}')
author : Cliburn
description : Simulated data sets for BIOS-821 demo
In [25]:
def show(name):
print(root[name])
In [26]:
root.visit(show)
<HDF5 group "/expt000" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt001" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt002" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt003" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
Load a specific data set into memory¶
In [27]:
data = root['expt002/sim003']
In [28]:
data.shape
Out[28]:
(1000, 1000)
In [29]:
list(data.attrs.keys())
Out[29]:
['modified']
In [30]:
data.attrs['modified']
Out[30]:
'2017-07-02 18:15:24+00:00'
Close handler¶
In [31]:
root.close()
JSON¶
JSON (JavaScript Object Notation) is a very simple plain text standard that is now ubiquitously used for data transfer, especially over the web via the Representational State Transfer (REST) protocol.
JSON consists of objects (resemble Python dictionaries, except that all keys must be strings) and arrays (resemble Python lists), with values coming from
- strings - must be double quoted
- numbers - integers or floats
- boolean - true or false
- null
Objects and arrays can be nested to arbitrary levels.
The json package in the standard library will convert JSON objects
to Python dictionaries and JSON arrays to Python lists, and vice-versa.
A JSON Object¶
Note that there must NOT be a comma after the last key: value pair. This is unlike Python dictionaries where trailing commas are OK.
{
"class": "Cleric",
"power source": "Divine",
"role": "Leader"
}
A JSON Array¶
Here we show an array of objects, each of which has 3 key-value pairs.
[
{
"class": "Cleric",
"power": "Divine",
"role": "Leader"
},
{
"class": "Fighter",
"power": "Martial",
"role": "Defender"
},
{
"class": "Paladin",
"power": "Divine",
"role": "Defender"
},
{
"class": "Ranger",
"power": "Martial",
"role": "Striker"
},
{
"class": "Rogue",
"power": "Martial",
"role": "Striker"
},
{
"class": "Warlock",
"power": "Arcane",
"role": "Striker"
},
{
"class": "Warlord",
"power": "Martial",
"role": "Leader"
},
{
"class": "Wizard",
"power": "Arcane",
"role": "Controller"
}
]
JSON and Python¶
In [32]:
import requests
JSON from REST API¶
JSON is particularly important for transferring data over the web. More and more databases now allow you to access their data in a granular fashion using REST API calls. We illustrate with retrieval of information about berries that might be used to feed your Pokemon.
Berries are small fruits that can provide HP and status condition restoration, stat enhancement, and even damage negation when eaten by Pokémon. Check out Bulbapedia for greater detail. Description from PokeAPI
In [33]:
response = requests.get('http://pokeapi.co/api/v2/berry/1/')
In [34]:
data = response.json()
In [35]:
data
Out[35]:
{'firmness': {'name': 'soft',
'url': 'http://pokeapi.co/api/v2/berry-firmness/2/'},
'flavors': [{'flavor': {'name': 'spicy',
'url': 'http://pokeapi.co/api/v2/berry-flavor/1/'},
'potency': 10},
{'flavor': {'name': 'dry',
'url': 'http://pokeapi.co/api/v2/berry-flavor/2/'},
'potency': 0},
{'flavor': {'name': 'sweet',
'url': 'http://pokeapi.co/api/v2/berry-flavor/3/'},
'potency': 0},
{'flavor': {'name': 'bitter',
'url': 'http://pokeapi.co/api/v2/berry-flavor/4/'},
'potency': 0},
{'flavor': {'name': 'sour',
'url': 'http://pokeapi.co/api/v2/berry-flavor/5/'},
'potency': 0}],
'growth_time': 3,
'id': 1,
'item': {'name': 'cheri-berry', 'url': 'http://pokeapi.co/api/v2/item/126/'},
'max_harvest': 5,
'name': 'cheri',
'natural_gift_power': 60,
'natural_gift_type': {'name': 'fire',
'url': 'http://pokeapi.co/api/v2/type/10/'},
'size': 20,
'smoothness': 25,
'soil_dryness': 15}
Serializing to disk¶
In [36]:
import json
In [37]:
filename = 'data/{}.json'.format(data['name'])
filename
Out[37]:
'data/cheri.json'
In [38]:
with open(filename, 'w') as f:
json.dump(data, f)
De-serializing¶
In [39]:
with open(filename) as f:
cheri = json.load(f)
cheri
Out[39]:
{'firmness': {'name': 'soft',
'url': 'http://pokeapi.co/api/v2/berry-firmness/2/'},
'flavors': [{'flavor': {'name': 'spicy',
'url': 'http://pokeapi.co/api/v2/berry-flavor/1/'},
'potency': 10},
{'flavor': {'name': 'dry',
'url': 'http://pokeapi.co/api/v2/berry-flavor/2/'},
'potency': 0},
{'flavor': {'name': 'sweet',
'url': 'http://pokeapi.co/api/v2/berry-flavor/3/'},
'potency': 0},
{'flavor': {'name': 'bitter',
'url': 'http://pokeapi.co/api/v2/berry-flavor/4/'},
'potency': 0},
{'flavor': {'name': 'sour',
'url': 'http://pokeapi.co/api/v2/berry-flavor/5/'},
'potency': 0}],
'growth_time': 3,
'id': 1,
'item': {'name': 'cheri-berry', 'url': 'http://pokeapi.co/api/v2/item/126/'},
'max_harvest': 5,
'name': 'cheri',
'natural_gift_power': 60,
'natural_gift_type': {'name': 'fire',
'url': 'http://pokeapi.co/api/v2/type/10/'},
'size': 20,
'smoothness': 25,
'soil_dryness': 15}
Relational Database (SQLite3)¶
SQLite3 is part of the standard library. However, the mechanics of using essentially any relational database in Python is similar, because of the Python DB-API.
In [40]:
import sqlite3
In [41]:
c = sqlite3.connect('data/Chinook_Sqlite.sqlite')
SQLite specific commands to get metadata¶
Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You’ll have to read the docs to find out what is needed for your SQL flavor.
What tables are there in the database?¶
In [42]:
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))
Out[42]:
[('Album',),
('Artist',),
('Customer',),
('Employee',),
('Genre',),
('Invoice',),
('InvoiceLine',),
('MediaType',),
('Playlist',),
('PlaylistTrack',),
('Track',)]
What are the columns of the table “Album”?¶
In [43]:
list(c.execute("PRAGMA table_info(Album);"))
Out[43]:
[(0, 'AlbumId', 'INTEGER', 1, None, 1),
(1, 'Title', 'NVARCHAR(160)', 1, None, 0),
(2, 'ArtistId', 'INTEGER', 1, None, 0)]
Standard SQL statements with parameter substitution¶
Note: Using Python string substitution for Python defined parameters is
dangerous because of the risk of SQL injection
attacks. Use parameter substitution with ?
instead.
Do this¶
In [44]:
t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))
Out[44]:
[(1, 'For Those About To Rock We Salute You', 1),
(4, 'Let There Be Rock', 1),
(59, 'Deep Purple In Rock', 58),
(108, 'Rock In Rio [CD1]', 90),
(109, 'Rock In Rio [CD2]', 90)]
Not this¶
In [45]:
t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))
Out[45]:
[(1, 'For Those About To Rock We Salute You', 1),
(4, 'Let There Be Rock', 1),
(59, 'Deep Purple In Rock', 58),
(108, 'Rock In Rio [CD1]', 90),
(109, 'Rock In Rio [CD2]', 90)]
Using SQL magic functions¶
We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.
In [46]:
import warnings
with warnings.catch_warnings():
warnings.simplefilter('ignore')
%load_ext sql
Configuring the SqlMagic extension¶
In [47]:
%config SqlMagic
SqlMagic options
--------------
SqlMagic.autolimit=<Int>
Current: 0
Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
Current: False
Return Pandas DataFrames instead of regular result sets
SqlMagic.column_local_vars=<Bool>
Current: False
Return data into local variables from column names
SqlMagic.displaylimit=<Int>
Current: 0
Automatically limit the number of rows displayed (full result set is still
stored)
SqlMagic.dsn_filename=<Unicode>
Current: 'odbc.ini'
Path to DSN file. When the first argument is of the form [section], a
sqlalchemy connection string is formed from the matching section in the DSN
file.
SqlMagic.feedback=<Bool>
Current: True
Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
Current: True
Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
Current: 'DEFAULT'
Set the table printing style to any of prettytable's defined styles
(currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)
In [48]:
%config SqlMagic.displaylimit=10
Connect to SQLite3 database¶
See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.
In [49]:
%sql sqlite:///data/Chinook_Sqlite.sqlite
Out[49]:
'Connected: None@data/Chinook_Sqlite.sqlite'
In [50]:
%sql SELECT * from Album LIMIT 5;
Done.
Out[50]:
| AlbumId | Title | ArtistId |
|---|---|---|
| 1 | For Those About To Rock We Salute You | 1 |
| 2 | Balls to the Wall | 2 |
| 3 | Restless and Wild | 2 |
| 4 | Let There Be Rock | 1 |
| 5 | Big Ones | 3 |
In [51]:
%sql SELECT * from Artist LIMIT 5;
Done.
Out[51]:
| ArtistId | Name |
|---|---|
| 1 | AC/DC |
| 2 | Accept |
| 3 | Aerosmith |
| 4 | Alanis Morissette |
| 5 | Alice In Chains |
In [52]:
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;
Done.
Out[52]:
| Name | Title |
|---|---|
| AC/DC | For Those About To Rock We Salute You |
| AC/DC | Let There Be Rock |
| Aaron Copland & London Symphony Orchestra | A Copland Celebration, Vol. I |
| Aaron Goldberg | Worlds |
| Academy of St. Martin in the Fields & Sir Neville Marriner | The World of Classical Favourites |
You can assign results of queries to Python names¶
In [53]:
result = %sql SELECT * from Album;
Done.
In [54]:
type(result)
Out[54]:
sql.run.ResultSet
In [55]:
result[2:4]
Out[55]:
[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]
You can use Python variables in your queires.¶
Use :varname where you want to use a Python variable in your query.
In [56]:
artist_id = 10
%sql select * from Artist where ArtistId < :artist_id;
Done.
Out[56]:
| ArtistId | Name |
|---|---|
| 1 | AC/DC |
| 2 | Accept |
| 3 | Aerosmith |
| 4 | Alanis Morissette |
| 5 | Alice In Chains |
| 6 | Antônio Carlos Jobim |
| 7 | Apocalyptica |
| 8 | Audioslave |
| 9 | BackBeat |
In [57]:
word = '%rock%'
%sql select * from Album WHERE Title LIKE :word;
Done.
Out[57]:
| AlbumId | Title | ArtistId |
|---|---|---|
| 1 | For Those About To Rock We Salute You | 1 |
| 4 | Let There Be Rock | 1 |
| 59 | Deep Purple In Rock | 58 |
| 108 | Rock In Rio [CD1] | 90 |
| 109 | Rock In Rio [CD2] | 90 |
| 213 | Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK] | 139 |
| 216 | Hot Rocks, 1964-1971 (Disc 1) | 142 |
Convert to pandas dataframe¶
In [58]:
df = result.DataFrame()
df.head(5)
Out[58]:
| AlbumId | Title | ArtistId | |
|---|---|---|---|
| 0 | 1 | For Those About To Rock We Salute You | 1 |
| 1 | 2 | Balls to the Wall | 2 |
| 2 | 3 | Restless and Wild | 2 |
| 3 | 4 | Let There Be Rock | 1 |
| 4 | 5 | Big Ones | 3 |